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Abstract 

Aims. An efficient algorithm for calculating radiative transfer on massively parallel computers using domain decomposition is presented. 
Methods. The integral formulation of the transfer equation is used to divide the problem into a local but compute-intensive part for calculating 
the intensity and optical depth integrals, and a nonlocal part for communicating the intensity between adjacent processors. 

Results. The waiting time of idle processors during the nonlocal communication part does not have a severe impact on the scaling. The wall 
clock time thus scales nearly linearly with the inverse number of processors. 
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1. Introduction 

Over the past one or two decades tremendous advances have 
been made in achieving high resolution power in computational 
astrophysical fluid dynamics; see Haugen et al. ( 2003 ) for a 
1024 3 simulation of hydromagnetic turbulence and Kaneda et 
al. ( 120031 for a 4096 3 simulation without magnetic fields. Such 
high resolution is now possible mainly due to the availability of 
massively parallel computers allowing superb performance at 
a low price, especially if off-the-shelf personal computers can 
be interconnected using standard Ethernet switches. This poses 
no major difficulty for the usual finite difference schemes that 
allow the computational domain to be decomposed into smaller 
sub-domains, because the necessary communication between 
processors is limited to a small neighborhood of the processor 
boundaries. 

Radiative transfer calculations fall generally outside this 
class of problems, because the transfer equation is intrinsically 
nonlocal. Physically speaking, information travels at the speed 
of light when the gas is optically thin. Thus, from one time step 
to the next, the change in intensity in the domain of one pro¬ 
cessor can affect the radiation field on many other processors 
even if they are far apart. 

In this paper we describe a simple method that renders the 
transfer problem essentially local - at least as far as the bulk of 
the computational cost is concerned. By using the integral for¬ 
mulation of the transfer equation, the intensity may be written 
as a local integral term plus an attenuated boundary term. The 
local integral term may be computed in parallel by all proces¬ 


sors whereas only the boundary term, which may be applied 
after the integrals have been computed on all processors, re¬ 
quires communication between processors. 

The attenuated boundary terms only need to be computed 
on and communicated across those boundaries where the radi¬ 
ation leaves each sub-domain (hereafter referred to as down¬ 
stream boundaries). The corresponding update of the interior 
of each sub-domain may be carried out afterwards and again is 
a completely local operation. 

The efficiency of our parallelization method depends heav¬ 
ily on how rapidly the attenuated boundary terms may be ob¬ 
tained during the communication step. Due to this dependency, 
our method is only practical if the radiation is strictly along 
straight lines and does not diffuse in the transverse direction 
via interpolation, as in the short characteristics method, for ex¬ 
ample. This is discussed in detail in Sect. 13.31 

Our technique may be contrasted with other popular ap¬ 
proaches to solving the transfer equation in decomposed do¬ 
mains. In the multiple wavefront method (Nakamoto et al. 
12001 1. parallel efficiency is achieved by allowing different ray 
directions to be treated simultaneously, trying to avoid multi¬ 
ple tasks for the same processor and minimizing the number of 
idle processors. This method does not impose any restrictions 
on the radiative transfer scheme, but efficient parallelization re¬ 
quires a large number of ray directions and frequencies to be 
treated. If a non-diffusive scheme is employed, our technique 
appears simpler and remains efficient even when the number 
of CPUs greatly exceeds the number of ray directions and fre¬ 
quencies. 
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2. The transfer equation 

Radiation couples with the equations of fluid dynamics both 
through radiative heating and cooling and, if the temperatures 
are high enough, through radiative pressure. For the applica¬ 
tions that we currently have in mind (e.g. stellar convection and 
protostellar accretion discs), only heating/cooling is important, 
so V • F enters the energy equation 

De 

P~ + ' “ + V ' F = ^ diss ’ W 

where p is the mass density, p is the pressure, u is the fluid ve¬ 
locity, e is the internal energy per unit mass, F is the energy 
flux, and gdiss is the (kinetic and/or magnetic) energy dissipa¬ 
tion. For radiative energy transfer the energy flux is given by 


F = 


4/r 


'/■ 


dFl I dv hl v (h). 


( 2 ) 


where I is the specific intensity giving the amount of energy 
transported by radiation per unit frequency range per unit area 
per unit time into a solid angle Q in the direction h. 

To determine the specific intensity one has to solve the 
transfer equation (e.g. Mihalas & Weibel-Mihalas 11984k 


n ■ V/ v — V■ ( S i A ). 


(3) 


where ft is the unit vector in the direction of propagation, Xv 
is the opacity (per unit volume) or inverse mean free path of 
a photon, and S is the source function, which gives the ratio 
between emission and absorption. 

The transfer equation <□ is here written in its time inde¬ 
pendent form. This is appropriate for the non-relativistic case, 
where the maximum fluid velocity is much lower than the speed 
of light. The flux divergence is then given by 


V F = 



dv Xv(Sy - Iy). 


(4) 


FollowingNordlund (1982), we define Q v - S v -I v , giving 
the cooling rate per ray direction and infinitesimal frequency 
interval, and the optical depth scale r y = fxvds, where s is 
measured along the propagation direction of the ray. It is then 
possible to rewrite m as 

dQv _ dS y 

r — i Sdv W J 

ClTy dTy 

This equation may also be written in integral form, 


Q(t) = Q(r 0 )e T ' 


Jr„ dT’ 


e (inlr, (r) 


(6) 


integro-differential equation which has to be solved by means 
of an iterative scheme, such as Accelerated Lambda Iteration 
(see Olson et al. 11986 ). However, during each iteration step the 
source function is given (i.e. taken from a previous iteration 
step) and in the following we may - without loss of generality 
- assume that the source function is independent of intensity. 


3. The radiative transfer scheme 


For the sake of simplicity we here assume that the set of ray di¬ 
rections is chosen in such a way that all rays travel directly 
through neighboring grid points. This will suffice to moti¬ 
vate our method and we delay the discussion of interpolation 
schemes for solving the transfer equation for arbitrary ray di¬ 
rections until Sect. 13.31 

With the above assumption, it is in principle easy to solve 
0 for all ray directions. Given the cooling rate at a mesh point 
n— 1, the discretization of © enables us to compute the cool¬ 
ing rate at the next mesh point n in the direction of the ray. 
Once a given boundary condition is adopted, it is thus possible 
to determine the cooling rate in the entire simulation box by 
stepping successively along the ray. 

However, in the case of domain decomposition, only those 
processors adjacent to a boundary of the simulation box are 
able to immediately compute the correct cooling rate within 
their sub-domain. All other processors have to wait until they 
are provided with boundary information from a neighboring 
processor that already has determined the correct cooling rate. 
Without further sophistication, this would imply that most 
computation related to the radiative transfer problem is not car¬ 
ried out in parallel and valuable CPU time is spent in waiting. 

Fortunately, for a local source function (e.g. independent 
of mean intensity), the integral term G (lntr) in represents a 
valid solution of the transfer equation within each sub-domain, 
apart only from the contribution from the upstream boundary. 
We call this the intrinsic solution. 


Qn 


Wmtr) 

xZn-1 


St,,- 




(7) 


with 


Go lntr) = o and dr„_ i/2 = r„ - t„_, . (8) 

The complete solution for an arbitrary boundary condition Qq 
may be obtained by simply adding the correction term Qo e To ~ T " 
to the intrinsic solution on all inner points, 

Qn = Qoe T °~ T " + gj 1 ,nlr) . (9) 


In order to reduce the idle time of the individual processors 
we split the calculation of the cooling rate into three distinct 
parts, two of which may be carried out by all processors in 
parallel. How this works in detail is illustrated in the following. 


where the explicit reference to the frequency v has been 
dropped. By using Q instead of I numerical precision is re¬ 
tained even when the optical depth is very high and I ap¬ 
proaches S very closely. 

In general the source function S may depend on the in¬ 
tensity itself, i.e. on a nonlocal quantity, turning into an 


3.1. Non-periodic boundaries 

We first assume that our computational domain is non-periodic 
in all spatial dimensions. As far as radiative transfer is con¬ 
cerned, this is the simplest case - the periodic case is treated in 
the next subsection. 
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Q(t) = Q 0 e T °- T + 

/ e T '- T S\T')dT' 
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Figure 1 . Illustration of the radiative transfer scheme. See text 
for details. 


The basic procedure to obtain the solution to the radia¬ 
tive transfer problem in decomposed domains is illustrated in 
Fig n For the sake of simplicity, the computational domain is 
two-dimensional and divided into four sub-domains. The trans¬ 
fer equation is solved forrays along the direction indicated. The 
generalization to three-dimensional domains is trivial. 

In the following, Q o refers to the cooling rate on the local 
upstream boundary of a given sub-domain, which either co¬ 
incides with the boundary condition of choice for the entire 
computational domain (global boundary) or overlaps with the 
downstream boundary of a neighboring processor in the direc¬ 
tion opposite to the ray. 

The first step is to obtain the intrinsic solution Q llntr> within 
each sub-domain, assuming a vanishing cooling rate at the 
boundary. This corresponds to evaluating the integral in Eq. 

For each point the cooling rate and the optical depth t-tq are 


stored. This step can be carried out by all processors in paral¬ 
lel since no information is required from outside the processor 

(Rg. in?). 

The communication part follows next. The local boundary 
cooling rate Qq in the lower ghost zone of processors 3 and 4, 
as well as in the left ghost zone of processors 1 and 3 are given 
by the global boundary condition of choice (Fig. m. Since all 
its upstream boundaries are set, processor 3 can immediately 
compute the correct cooling rate on its upper and right bound¬ 
aries by adding the correction term <2o e T °~ TN to the cooling rate 
obtained from the intrinsic solution (Fig.[0l). Here t o and Qq 
refer to a point in the left (lower) ghost zone and r, v to the cor¬ 
responding point at the upper (right) boundary along the ray. 

Now that the correct cooling rate on the upper (right) 
boundary of processor 3 is available, this information is com¬ 
municated to processor 1 (4) where the boundary condition in 
the lower (left) ghost zone is set (Fig.Qj). Fikewise, processor 
1 (4) is now able to compute directly the cooling rate on its 
right (upper) boundary and can send the values to processor 2 
(Figs, nr and g). 

In Fig. nfe all information necessary to solve the full trans¬ 
fer equation on every point on all processors is available and 
the communication part is finished. The last step is again car¬ 
ried out by all processors in parallel and independently of 
each other. It amounts to simply adding the correction term 
O 0 e T °~ T ", this time on all inner points in the sub-domain 

(Fig. Eh). 

3.2. Periodic boundaries 

For many applications it is convenient to assume periodicity of 
the simulation box in one or more spatial directions. An exam¬ 
ple is convection in an infinitely extended plane-parallel layer. 
While this is trivial to implement for the HD- and MHD-part 
of a scheme, periodicity introduces a potential difficulty to the 
radiative transfer scheme. 

In the non-periodic case there is always at least one pro¬ 
cessor where all upstream boundaries are entirely set, once 
the boundary condition for the whole simulation box has been 
used. In the example setup of the previous subsection this 
would be processor 3. By determining the cooling rate on its 
downstream boundaries and communicating to all its neigh¬ 
bors, all upstream boundaries of these neighbors are entirely 
set, and so on. This implies that each processor has to propa¬ 
gate boundary values only once through its domain. 

In contrast to the above, in the case of periodicity, it might 
become necessary to propagate boundary values several times 
through each sub-domain. This is illustrated in Fig. |3 The 
computational box in this example is taken to be periodic in the 
horizontal direction, so that only the heating rates in the lower 
ghost zones of processors 3 and 4 are known a priori from the 
global boundary condition (Fig. Eb) Without communication, 
this information does not suffice to entirely cover any of the 
downstream boundaries of processors 3 or 4 for the given ray 
direction (Fig. Et) 

In order to cover all up- and downstream boundaries of pro¬ 
cessors 3 and 4, it is necessary to communicate the available 
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Q(r) = Qoe To ~ T + 
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Figure 2. Same as Fig. ^ except that here the computational 
domain is periodic in the horizontal direction. Note that the ray 
travels at an inclination of 22.5° relative to the horizontal axis. 


downstream heating rates along the periodic direction several 
(in this case two) times (Figs. EU-H). 

3.3. Remarks on interpolation 

So far we have only considered rays that pass directly through 
neighboring grid points. In general we would like to take into 
account rays with arbitrary inclination, so we have to interpo¬ 
late in the angular direction. According to Stone et al. ( 1992 :i 
there are essentially two approaches to this problem which we 
discuss now briefly, taking into account their compatibility with 
our proposed parallelization method. 

The first approach we consider is to set up exactly one ray 
per grid point and trace it all the way back to the boundary of 
the computational domain, interpolating the values for opac¬ 


ity and source function along the way. This is usually called 
the method of long characteristics. For large numerical grids in 
two (three) spatial dimensions, the above prescription is, how¬ 
ever, hardly ever used in this form as it scales as A 3 (A 4 ) with 
the number of grid points A. A common cure for this problem 
is to introduce in addition to the ordinary hydrodynamic al (HD) 
grid an arbitrarily inclined radiation grid for each ray direction, 
with grid points along a number of parallel rays in that direc¬ 
tion (e.g. Nordlund 1982; Stein & Nordlund 1988; Razoumov 
& Scott 1999; Juvela & Padoan 2005). The values for opacity 
and source function are interpolated from the HD grid onto the 
radiation grid, the transfer equation is solved along each of the 
parallel rays, and the solution (in terms of the radiative cooling 
rate) is finally interpolated back onto the HD grid. 

This approach is well suited for incorporation into our par¬ 
allelization method because the radiation does not diffuse out 
on the radiation grid. Hence, a ray that reaches a point on 
the downstream boundary of a processor’s sub-domain can be 
traced back to a unique point at the upstream boundary and 
the attenuated boundary terms may thus be rapidly computed 
during the communication step of our method. This is cru¬ 
cial for keeping the idle times of the individual processors at 
a minimum. Furthermore, interpolation between the two sep¬ 
arate grids is a completely local operation that can be carried 
out by all processors in parallel before and after the communi¬ 
cation step. Thus, the full advantage of our method can still be 
exploited. In fact, because it is a local operation, angular inter¬ 
polation actually improves our method’s scaling with the num¬ 
ber of processors since the communication time then becomes 
- relative to the overall expense of obtaining the full solution 
to the transfer equation - even less significant. 

The other approach is to use the method of short character¬ 
istics (e.g. Kunasz & Auer 19881 Auer & Paletou l 1994 1 Auer et 
al.|T333». In this method, the rays are cut off at cell boundaries, 
and the radiation intensity is interpolated onto neighboring grid 
points. As a result, the radiation along rays that do not travel di¬ 
rectly through grid points diffuses away from the exact down¬ 
stream direction. Due to this diffusion, the radiation reaching 
one particular grid point on the downstream boundary of a pro¬ 
cessor’s sub-domain depends in a highly non-trivial (unphysi¬ 
cal) manner on the radiation coming from a substantial number 
of grid points on its upstream boundary. In order to propagate 
the boundary radiation values through downstream processors 
one would thus essentially have to repeat the radiative transfer 
solution again, but now with each processor dependent on its 
upstream neighbor. This would remove the advantage of being 
able to do most of the work independently on each processor. 
The short characteristics method is thus not suitable for paral¬ 
lelization with the method presented here. 


3.4. Periodic rays 

For rays traveling along a periodic direction, there is no bound¬ 
ary condition to start from. However, writing the transfer equa¬ 
tion in its integral form. 


Qn = Go e To ~ TN 


J r'Tjv 

e r-r N 

TO 



( 10 ) 
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where the subscripts 0 and N refer to corresponding points on 
opposite sides of the simulation box, and using the periodicity 
condition Q # = Go, it is possible to solve for the cooling rate 

Go, 

r T N jc 

(1 _ e To ~ TN ) Go = e T ~ TN —dT. (11) 

Jto dr 

In a decomposed domain we may thus obtain the full periodic 
ray solution in much the same way as in the previous section 
once Go is known. However, additional communication is re¬ 
quired to calculate Go itself, which means that we have to com¬ 
municate twice through the simulation box. 

The periodic ray solution is illustrated in Fig. 0 for a ray 
traveling in the positive x-direction across 4 processors. The 
source function and opacity profile are depicted in the two up¬ 
permost panels, followed by the intrinsic cooling rate on each 
processor, the corresponding corrections and the final cooling 
rate. 

4. Implementation into the Pencil Code 

The method for solving radiative transfer problems, as outlined 
in the previous section, has been implemented into the Pencil 
Code 1 . This code has been specifically designed for turbulence 
simulations in a parallel computing environment using the do¬ 
main decomposition scheme. 

The numerical scheme consists of a third order Runge- 
Kutta method due to Williamson (11 980! ) for the time step¬ 
ping and sixth order centered finite differences in space; see 
Brandenburg for details. The code is able to do domain 

decomposition in two spatial dimensions (y- and ^-direction) 
using the Message Passing Interface (MPI) for interprocessor 
communications. 

For the numerical solution of the transfer equation we ap¬ 
proximate the source function by a second order polynomial in 
optical depth (see Bruls et al. 119991 . The integral in Eq. © 
may then be solved exactly. Numerical details are given in 
Appendix | aJ available at the CDS. It is however worth noting 
here that the intrinsic solution where an arbitrary but definite 
boundary condition is employed may equally well be obtained 
by virtue of Feautrier’s (see Mihalas fl 9781 or any other suitable 
method. 

In fact, we have found in another context that on most 
(but not all) CPUs, the Feautrier method is faster by up to 
about a factor of two, relative to the most optimized integral 
method (see Appendix |A] for details). However, since the in¬ 
tegral method is more general (applicable for example also in 
cases with a combination of Doppler effect and polarization in 
spectral lines), we choose to present its implementation here. 

When solving the intrinsic part of the transfer equation we 
store the difference in optical depth between all grid points and 
the upstream boundary in a 3-dimensional array. This allows 
us to quickly compute the attenuated boundary terms and add 
them to the intrinsic cooling rate on the downstream boundary 
(during the communication step) as well as to all non-boundary 
grid-points (afterwards). 

1 http://www.nordita.dk/software/pencil-code 
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Figure 3. Illustration of the periodic ray solution for one ray 
traveling in the positive x-direction across 4 processors. 


5. Benchmarks results 

The scaling of our method with the number of processors de¬ 
pends on how many processors that can simultaneously com¬ 
pute the attenuated cooling rates on the downstream boundaries 
of each processor. This in turn depends on 

- the number of ray directions. 

- the type of boundary condition (periodic or non-periodic), 

- the shape (in terms of grid points) and distribution of sub- 
domains 
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Table 1. Distribution of sub-domains in y and z for up to 8 
processors. 


«cpu 12 4 8 


planar 

sub-domains 


columnar 

sub-domains 


To investigate the above dependencies we have performed 
a series of benchmarks on the Beowulf cluster at the Danish 
Center for Scientific Computing. All our tests ran on a sub¬ 
network with 302 Intel Pentium 4 machines with 3.2 GHz 
CPUs and 1 GB memory, connected with Gigabit Ethernet and 
10-Gigabit uplinks. On the software side, we used the Intel 
Fortran 8.1 compiler and the LAM MPI implementation. 

In the benchmark series the number of processors, N, in¬ 
creases from 2 to 128 in powers of two. To be representative of 
typical applications of our scheme, each benchmark is a short¬ 
lived 3-D hydrodynamic al simulation of a (gray) solar atmo¬ 
sphere near the surface. As is explained in Appendix|BJ avail¬ 
able at the CDS, the ionization fraction that enters the equation 
of state is calculated in an iterative fashion from the thermody¬ 
namic variables. We use the ionization fraction to calculate the 
number density of negative hydrogen ions, which are the only 
source of opacity in these simulations. 

To find out whether the choice of boundary conditions for 
the radiative cooling rate has a significant influence on the scal¬ 
ing, the entire computational domain is either periodic or non¬ 
periodic in the horizontal x- and v-directions. The boundary 
condition in the ^-direction is in both cases non-periodic. 

Furthermore, we have taken into account two different sub- 
domain shapes, characterized by the number of grid points in 
each spatial direction: “planar” sub-domains with m x = m y = 64 
grid points in the x- and y-directions and m z — 32 grid points in 
the ’-direction, and “columnar” sub-domains with m x = m z = 64 
grid points in the x- and ’-directions and m y = 32 grid points in 
the y-direction. Depending on their shape, these sub-domains 
are distributed in the following way. If N is an even power of 
two, there are as many processors in the y-direction as there are 
in the z-direction (N y = N z ). If N is an odd power of two, then 
for planar domains there are twice as many processors in the 
Z-direction as there are in the y-direction (N z = 2 N y ), and for 
columnar domains it is the other way around (N y = 2N z ). This 
is illustrated in TableHfor up to eight processors. 

In our present implementation where we only solve for rays 
that travel through grid points, there are three qualitatively dif¬ 
ferent sets of ray directions determined by the geometry of a 
grid cell: 

- Rays along coordinate axes (6 ray directions). 
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Figure 4. Wall clock time per grid point and time step in a 3- 
dimensional simulation of the solar surface for different sets of 
ray directions - in this case for columnar sub-domains and non¬ 
periodic boundary conditions for the cooling rate. The straight 
lines denote least square fits. Perfect scaling would be repre¬ 
sented by a horizontal line. 


- This plus rays along face diagonals (6 + 12-4 = 14 ray 
directions). 

- This plus rays along the space diagonals (14 + 8 = 22 ray 
directions). 

Rays along face diagonals only make up 12-4 = 8 additional ray 
directions since rays along those face diagonals that lie in the 
horizontal plane are impractical to solve for if periodic bound¬ 
ary conditions are employed, and, for comparison, we left them 
out in the case of non-periodic boundary conditions as well. 

The reason we carried out separate benchmarks for each of 
the above sets is that one could expect a decrease in scaling 
performance as one goes from 6 ray directions to 22. For rays 
along the edges of a grid cell there is always a whole layer of 
processors that can simultaneously receive the boundary cool¬ 
ing rate and propagate the attenuated cooling rate computed at 
the downstream boundary to the next layer. For inclined rays 
however, the number of processors that can do the communi¬ 
cation simultaneously varies as one communicates through the 
entire domain and is generally less or equal than for rays along 
the edges. For comparison with ordinary hydrodynamical do¬ 
main decomposition, we also included a benchmark with no 
radiative transfer at all for each type of boundary condition and 
sub-domain shape. In total, there are thus 4x2x2= 16 bench¬ 
marks per processor number. 

In Fig.[4]we show the scaling of the wall clock time per grid 
point and time step in a 3-dimensional simulation of the solar 
surface for different sets of ray directions using columnar sub- 
domains and non-periodic boundary conditions for the cooling 
rate. Perfect scaling would be represented by a horizontal line 
in this figure. Due to the locality of the Navier-Stokes equa¬ 
tions, the purely hydrodynamic benchmark series (0 rays) are 
close to being perfect in the above sense. In comparison, the 
radiative benchmarks (6, 14, and 22 rays) show for each type 
of boundary condition and sub-domain shape a slight increase 
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Figure 5. Least square fits to the wall clock time per grid point 
and time step for all sets of ray directions, boundary conditions 
for the cooling rate, and sub-domain shapes. 


in the wall clock time with the number of processors, but gen¬ 
erally with at most about 30% as «cpu varies from 1 to 128. 

In Fig. □ we compare the scaling behavior for different 
shapes of the sub-domains, both for periodic and non-periodic 
rays. The main conclusion to be drawn from this is that the 
difference in performance is surprisingly small. Nevertheless, 
as a general trend one can say that the scaling is slightly bet¬ 
ter for planar than for columnar sub-domains. On the other 
hand, columnar sub-domains perform slightly better for a small 
number of processors («cpu L 8). Within the accuracy of the 
measurements, which is not perfect due to constantly varying 
network and I/O performance on the cluster in the course of 
the benchmark series, we can say that neither the number of 
ray directions, nor the choice of boundary conditions and sub- 
domains shape has an appreciable impact on scaling with the 
number of processors. 

6. Conclusions 

In this paper we have presented a parallelization method for 
carrying out hydrodynamic al radiative transfer calculations on 
massively parallel computers using the domain decomposi¬ 
tion scheme. The proposed method is conceptually simple and 
straightforward to implement. It is also flexible in the sense that 
the solution of the transfer equation is not limited to the inte¬ 
gral method but may also be obtained by direct discretization 
(Feautrier type methods). 

We find that the proposed parallelization method scales al¬ 
most linearly with inverse number of processors, irrespective of 
the choice of boundary conditions, sub-domain shape, or num¬ 
ber of ray directions. The method is thus ideal for carrying out 
large hydrodynamical simulations on massively parallel super¬ 
computers using the domain decomposition scheme. 

The present implementation into the Pencil Code only rep¬ 
resents a first proof of concept. The inclusion of more ray direc¬ 
tions, a non-uniform vertical mesh, non-gray radiative transfer, 
radiation pressure, or scattering opacities are all examples of 


extensions that are still possible within the framework of our 
parallelization method. 

Our focus has not been to optimize the intrinsic part of the 
radiative transfer calculations, but already in the implementa¬ 
tion used for the benchmark series in Sect.|3 one can afford 22 
rays per mesh point at a cost of about 40% of the total time to 
advance the MHD-equations (60% relative to advancing only 
the HD-equations). With the fully optimized integral method 
(storing and reusing exponentials whenever possible), or with 
the Feautrier method, one can afford 2 to 4 times more rays per 
point, depending on the CPU and the type of network. 

The number of possible applications of our method is large. 
It has been tested successfully for simulations of the solar at¬ 
mosphere where even a gray opacity treatment with a small 
number of rays already gives quite useful results (but a more 
accurate model requires an opacity bin coverage, cf. Stein & 
Nordlund ! 1989 ). To give a list of other applications that is by 
no means exhaustive, the method is directly applicable to lo¬ 
cal simulations of accretions discs using the shearing sheet ap¬ 
proximations, to global models of stars or discs that are embed¬ 
ded in a Cartesian domain (e.g. Dobler et al. 120061 Freytag et 
al. l2002l . as well as to studies of radiatively driven ionization 
in HII regions. A generalization to time dependent radiative 
transfer would allow applications of the method to studies of 
the reionization of the Universe, as well as to other contexts 
where effects of the finite speed of light are important. 
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Appendix A: Numerical details 

In order to solve Eq. © one uses polynomial approximations 
for the opacity x and the source function S. In the current im¬ 
plementation for the Pencil Code, the opacity is assumed to 
vary linearly from a mesh point n to the next mesh point n +1 
that lies in the direction of the ray, so the difference in optical 
depth between these points, calculated by the trapezoidal rule, 
is 


6t„+ 1/2 = j(x,i + Xn+\) 6s n +l/2, (A.l) 

where 6 s n+ i /2 is the spatial distance between n and n+l. A more 
accurate (but also more expensive) method is based on a cubic 
spline fit, using the logarithmic derivative of the opacity, 

<5r„+ 1/2 = \\Xni. 1 + \d n 6 s) +Xn+ i(l - gr/«+idi)] 6 s, (A.2) 

where 


d n 



(A.3) 


and, for brevity, we write 6 s instead of <5 s„+i/ 2. While theoreti¬ 
cally this expression may yield negative values for 6t, for rea¬ 
sonably resolved temperature variations 6 s should not exceed 
unity (and certainly must not exceed 6), so in practice this does 
not happen. A third alternative is 


dr„ 


Xn+l Xn 

lnCr«+i/A«) 


ds n 


(A.4) 


but in practice this turns out to be less accurate than Eq. (IA.2> 
above. 

To calculate the cooling rate Qj, due to an ‘upwards’ di¬ 
rected ray, between t„ and t„+ i the quadratic Taylor approxi¬ 
mation for the source function is used, i.e. we ignore the third 
and higher derivatives of 5 (r). Equation © then gives 


the timings presented in Fig© were obtained without imple¬ 
menting this. 

For small values of 6t, the expressions for the coefficients 
a n , b„, and c„, must be computed in double precision to avoid 
loss of precision, but the final coefficients may be stored in 
single precision without noticeable loss of accuracy. On some 
CPUs further speed-up may be obtained by conditionally us¬ 
ing asymptotic expressions for these coefficients, while in other 
cases, especially where compiler options or special libraries are 
available to enable vectorization or other optimization of ex¬ 
ponentials, it may be faster to retain the explicit computation 
of exponentials. We have implemented coding that automati¬ 
cally chooses between these two alternatives, based on an ini¬ 
tial comparison of the speeds. 

The derivatives of S with respect to the optical depth r have 
to be computed on an irregularly spaced grid, i.e. 


/ 6S n ~ 1/2 dr„+i/2 + 6S n + 1/2 6T n -i/2 \l 
\(5t„_i/2 2 + 6t„+ i/2 2 // 


(A.9) 


\ 6t, ,+i /2 or,,_i /2 II 


(A. 10) 


where 6r n = (dr„_ 1/2 + 6t„+ i/2)/2 (see Bruls et al. il 999 ). The 
procedure to compute radiative energy transport in Cartesian 
geometry is then straightforward. For each ray direction there 
are in three dimensions - dependent on the inclination - either 
one, two or three upstream boundaries where a certain bound¬ 
ary condition is employed and the iteration as defined by JA.5> 
starts. By moving stepwise through the entire box it is possible 
to determine the cooling rate on every single mesh point and 
for all desired ray directions. 


Appendix B: Simulations of the solar atmosphere 


Oil ~ tin— 1/2 Qn— 1 1/2 ^ n —l Cn— 1/2 ^ n—l 

(A.5) 

with three coefficients 


«n- 1/2 = e~ ST "-' n , b n -\/i = 1 - e ‘ Sr ' , - I/2 , 

(A.6) 

and 


c„-i/ 2 = e-^Hl +tfr f ,_i /2 )- 1; 

(A.7) 

similarly, the ‘downwards’ directed rays give 


Qn — tln+l/2 Qn+l ~ b n -\-\/2 "t" Oj + 1/2 ^n+l‘ 

(A.8) 


One observes that in the limit of large optical depth, ((T + 
(Tj/2 —> - S" (diffusion limit). This demonstrates that the 
second derivative S" needs to be taken into account, as oth¬ 
erwise the numerically obtained total heating rates would be 
wrong (they would still be oc S " because up- and down-stream 
contributions do not cancel exactly, but with an incorrect and 
resolution-dependent coefficient). 

We point out that a factor of nearly two in computational 
speed may be gained in the radiative transfer part of the com¬ 
putations, at the expense of some storage space, by storing and 
reusing the e ~ Sr ”- 112 factors between two rays in opposite direc¬ 
tions. The speed increase was verified in smaller test cases, but 


The benchmark we used in Sect©is a short-lived simulation of 
the solar atmosphere near the surface without magnetic fields. 
For all benchmarks in the series (see Sect. 13 the physical size 
of the simulation box is 6 Mm x 6 Mm x 6 Mm in x, y, and 
z respectively. To account for the varying resolution of the nu¬ 
merical grid during the benchmark series, the viscosity is ap¬ 
propriately adjusted. The simulation box is periodic in both x 
and y, even if the boundary condition for the radiative cooling 
rate is non-periodic as it is the case for half of the benchmark 
runs. 

Furthermore, we use an equation of state that accounts for 
hydrogen ionization but ignores the negative hydrogen ion, 
H , and hydrogen molecule formation. The hydrogen ioniza¬ 
tion fraction is obtained iteratively from entropy and density 
by solving Saha’s equation. The H opacity is then calculated 
from the number density of H , which is again calculated from 
the number density of electrons and neutral hydrogen. The 
number density of H is very small, so even though H~ is the 
most important (and, in this simulation, only) contributor to the 
opacity, ignoring it in the equation of state is justified. 








